ReCIDE: robust estimation of cell type proportions by integrating single-reference-based deconvolutions

Abstract In this study, we introduce Robust estimation of Cell type proportions by Integrating single-reference-based DEconvolutions (ReCIDE), an innovative framework for robust estimation of cell type proportions by integrating single-reference-based deconvolutions. ReCIDE outperforms existing approaches in benchmark and real datasets, particularly excelling in estimating rare cell type proportions. Through exploratory analysis on public bulk data of triple-negative breast cancer (TNBC) patients using ReCIDE, we demonstrate a significant correlation between the prognosis of TNBC patients and the proportions of both T cell and perivascular-like cell subtypes. Built upon this discovery, we develop a prognostic assessment model for TNBC patients. Our contribution presents a novel framework for enhancing deconvolution accuracy, showcasing its effectiveness in medical research.


Introduction
Single-cell ribonucleic acid-sequencing (scRNA-seq) enables quantification of gene expression in individual cells, allowing for characterization of cellular heterogeneity within complex tissues.Despite its wide applications, scRNA-seq has been adopted to a limited extent in large-scale clinical cohort studies because of its cost [1].Although bulk RNA-seq and microarray technologies captures averaged gene expression across heterogeneous cell types in a tissue, they have been widely applied in biomedical research because of their cost-efficiency [2].A wealth of bulk RNAseq and microarray cohort data are available in publicly accessible databases such as Gene Expression Omnibus (GEO) and the Cancer Genome Atlas (TCGA) [3,4], with many complemented by valuable clinical information [5].Utilizing cell type deconvolution, a technique that estimates cell type proportions in bulk RNA-seq and microarray data by using scRNA-seq data as a reference, these extensive cohorts of clinically informative bulk datasets can be reanalyzed, from which clinically relevant cell types can be identified to aid in the prediction of disease prognosis and a deeper understanding of disease pathogenesis [6,7].
A typical deconvolution method would first obtain the expression profiles of cell type-specific marker genes, either from prior knowledge or derived directly from a reference of scRNA-seq data, and then employ a regression or machine learning-based method to estimate cell type proportions in bulk expression profiles.Representative methods include CIBERSORT [8], Fardeep [9], and DWLS [10].These methods usually do not model the variations in cell-level gene expression profiles between subjects in the reference because earlier scRNA-seq studies typically included only a few subjects.With increasing throughput and decreasing cost of scRNA-seq technology, it is now not uncommon to find dozens or even hundreds of samples included in one single scRNA-seq study.Accordingly, recently developed methods start to incorporate cross-subject variations in cell-level gene expression profiles into their methodologies.These methods include MuSiC [11], Bisque [12], and BayesPrism [13].Additionally, as multiple scRNA-seq studies become available for the same tissue of origin, methods such as SCDC-ENSEMBLE were developed to integrate the results produced by each single-cell reference set [14].A recently developed method-interRD-further integrated multiple single-cell reference sets and prior knowledge about cell type-specific marker genes and the population-level mean proportion of each cell type in a deconvolution scheme [15].
For current deconvolution methods, a major challenge in deconvolution is the batch differences between bulk and singlecell reference data, which are likely produced by both technical and biological reasons [12].The incorporation of cross-subject variations and the integration of multiple scRNA-seq reference sets further exacerbate the situation.To date, most deconvolution methods attempted to solve these problems within one statistical framework.However, accurately addresses all these intertwined factors with one statistical framework is not an easy task [14].An alternative approach that can take advantage of the increasing number of available subjects and reference sets without adding extra complexity is to apply a single reference-based method, such as CIBERSORT or DWLS, to conduct deconvolution separately by using each subject's single-cell scRNA-seq data as a reference, and then integrate the deconvolution results afterwards.Thus, here we propose a general framework called ReCIDE (Robust estimation of Cell type proportions by Integrating single-referencebased DEconvolutions), which introduces an innovative approach for robustly estimating cell type proportions by integrating singlereference-based deconvolutions.Given multiple deconvolution results produced by different scRNA-seq references, ReCIDE conducts dimension reduction and clustering to get highconfidence deconvolution results.Additionally, ReCIDE includes a module for identifying the marker gene matrix (MGM) based on the scRNA-seq references by optimizing the cosine similarity between the MGM and the bulk dataset, as a way to minimize the batch effect between the reference and target bulk datasets.
In our unbiased benchmarks, the ReCIDE-optimized deconvolution methods showed significantly improved deconvolution accuracy over the corresponding single reference-based deconvolution methods used to produce the initial deconvolution results, and ReCIDE-optimized DWLS consistently outperformed the other ReCIDE-optimized methods.Moreover, ReCIDE-optimized DWLS demonstrated superior accuracy over current multi-reference-based methods, such as MuSiC, Bisque, and BayesPrism, and multi-reference set-based methods, such as SCDC-ENSEMBLE.The primary advantage of ReCIDE-DWLS over existing methods is its outstanding performance in estimating the proportions of rare cell types (comprising less than 2% of the total cell population).This superiority is further validated through evaluations using real bulk RNA-seq and microarray datasets, where ReCIDE-DWLS successfully identified rare cell types with known clinical significance, while other competing methods failed to do so.
Finally, ReCIDE-optimized DWLS was applied to the bulk RNAseq and microarray datasets of triple-negative breast cancer (TNBC) patients, revealing between subclasses of T cells and perivascular-like (PVL) cells, both of which are rare cell types, and the prognosis of TNBC patients, demonstrating its clinical relevance.Based on this, a prognostic evaluation model was established using the proportions of PVL cells and their subgroups, which was used to assess the prognosis of TNBC patients.ReCIDE is therefore of compelling interest for reanalyzing large cohort of clinical bulk RNA-seq data and uncovering clinically significant cell types.ReCIDE is available as an open-source R package at https://github.com/TianLab-Bioinfo/ReCIDE.

The workflow of ReCIDE
The ReCIDE workf low comprises three primary steps: constructing and refining individual references' cell type-specific MGMs, deconvolution using each MGM, and integrating the deconvolution outcomes.
Step 1 involves preprocessing the scRNA-seq reference matrices (counts) along with subject and cell type labels.ReCIDE first organizes the reference matrix into subject-specific ones based on subject labels.Each subject-specific matrix undergoes marker gene selection using the cosg() function from the COSG R package, initially picking 100 marker genes for each cell type [16].Subsequently, a filtering process, as recommended by Cobos et al. [17], is applied based on the average expression fold change between the cell type with the highest and second-highest expression (Sec-ondFC).Marker genes with SecondFC greater than 1.5 are retained.The filtered reference matrix subset, with marker genes as rows and cells as columns, is then used to build the MGM matrix via the buildSignatureMatrixUsingSeurat() function in DWLS, aiming for the highest stability following Newman et al.'s method [8].ReCIDE further optimizes the MGM based on the correlation between reference expression profiles of cell type-specific marker genes and the target bulk expression profile.Specifically, ReCIDE sorts the marker genes of each cell type in ascending order according to their SecondFC values and then proceeds to assess each marker gene in the list.If removing a marker gene result in an improvement in the similarity score between the expression profile of the marker genes and the target bulk gene expression profile, it is removed; otherwise, it is retained.Here, the similarity score is calculated as the product of the number of marker genes for a specific cell type and the cosine similarity between the expression profiles of those marker genes and the target bulk gene expression profiles.
Step 2 involves deconvolution with individual references.ReCIDE uses each subject's MGM obtained in Step 1 independently for deconvoluting the target bulk dataset, experimenting with FARDEEP, CIBERORT, and DWLS as deconvolution methods.After benchmark testing, the solveDampenedWLS() function of DWLS is recommended as the deconvolution kernel.
In Step 3, ReCIDE integrates and refines the predicted results from Step 2. Assuming n individual reference subjects, ReCIDE performs principal component analysis (PCA) dimensionality reduction on the n sets of deconvolution outcomes based on cell types, utilizing the fast.prcomp()function in R [18].Subsequently, the reduced-dimensional results are clustered using Gaussian mixture models by the Mclust function from the Mclust R package [19].After that, ReCIDE retains the largest class in the clustering result.For each cell type, the average of the estimated proportions from the largest class is considered the best estimate.Finally, ReCIDE normalizes the estimates across all cell types to ensure their sum equals 1.

Benchmark studies
For benchmark studies, various scRNA-seq datasets were employed to create pseudo-bulk data and conduct benchmark tests.The datasets used were as follows: 1. Leave-one-out test -Pancreatic (PAN) dataset: obtained from Fasolino et al. [20] via the interactive website https://cellxgene.cziscience.com/.

Real studies
To further assess ReCIDE-DWLS under real biological conditions and explore its broader applicability, five microarray and bulk RNA-seq datasets from COVID-19, SLE, and TNBC patients were used for additional testing: -COVID-19 dataset: bulk RNA-seq data from Matthew et al., with accession number GSE231409 in the GEO database.PBMC scRNA-seq data from Stephenson et al. [24] served as the reference set.
-SLE dataset: microarray data from Kennedy et al. [32], with accession number GSE50772 in the GEO database.PBMC scRNAseq data from Stephenson et al. served as the reference set.
In this study, all datasets used in the benchmark studies were downloaded on December 15, 2023, while those used in the real studies were downloaded on December 22, 2023.

Evaluation metrics
To comprehensively evaluate the accuracy of different deconvolution methods, we employed two metrics: root mean square error (RMSE) and Pearson correlation coefficient (PCC).RMSE quantifies the absolute error between the deconvolution results and the ground truth, providing insights into the precision of the estimation.On the other hand, PCC measures the correlation between the deconvolution outcomes and the true values, offering insights into the consistency of the predictions.RMSE was computed using the rmse() function sourced from the ModelMetrics R package [36], and PCC was determined using the cor() function from the stats R package [37].

Construction and visualization of triple negative breast cancer single-cell reference sets
The TNBC reference atlases in this study were compiled from four sets of studies (Wu et al. (GSE176078) [28], Pal et al. (GSE161529) [29], Xu et al. (GSE180286) [35], and Bassez et al. [30]).To ensure consistency in cell type annotation, we employed the SciBet v1.0 function [26] to transfer cell type labels from the study of Wu et al. to the other three datasets, using default parameters.Visualizing the reference data involved initially removing batch effects with the IntegrateData() function from the Seurat v4.4.0 R package [27].Subsequently, 2000 variable genes were utilized for PCA to reduce dimensionality, followed by the selection of the 30 most informative principal components (PCs) for clustering and visualization using uniform manifold approximation and projection (UMAP) [38,39].PCA and UMAP were executed through the RunPCA() and RunUMAP() functions from the Seurat v4.4.0 R package, respectively.

Statistical tests
The Wilcoxon test was conducted using the stat_compare_means() function from the ggpubr R package [40,41].The pairwise_survdiff() and ggsurvplot() functions from the survminer R package were used for plotting Kaplan-Meier plots and log-rank tests [42].Fisher's test was conducted using the fisher.test()function from the stats R package [37].

CIBERSORT
CIBERSORT was obtained from https://github.com/Moonerss/CIBERSORT and utilized following the instructions provided at https://github.com/Moonerss/CIBERSORT/blob/main/README.md.The inputs included the bulk data and MGM.MGM construction began by initially selecting 100 marker genes for each cell type using the cosg() function from the COSG R package [16].Marker genes with a SecondFC (Second Fold Change) greater than 1.5 were retained, and the mean expression values of marker genes for each cell type were utilized to construct the MGM.All other parameters were left at their default settings.

DWLS
The DWLS R package v0.1.0was obtained from https://github.com/dtsoucas/DWLS and utilized by following the tutorials provided at https://github.com/dtsoucas/DWLS/blob/master/Manual.docx.Inputs for DWLS included the bulk data, raw counts gene expression matrix of reference scRNA-seq data, and cell type labels, with the latter being annotations from the reference scRNA-seq dataset.To enhance computational efficiency, the findmarker() function from the Seurat package, invoked by the buildSignatureMatrixUsingSeurat() function in DWLS, was substituted with the cosg() function from the COSG R package [16,27].All other parameters were maintained at their default settings.

BayesPrism
The BayesPrism R package v2.0 was obtained from https://github.com/Danko-Lab/BayesPrism and utilized by following the tutorial available at https://github.com/Danko-Lab/BayesPrism/blob/main/tutorial_deconvolution.pdf.Inputs for BayesPrism included the bulk data, the raw counts gene expression matrix of reference scRNA-seq data, cell type labels, and cell state labels.Cell type labels were derived from annotations in the reference scRNA-seq dataset.In accordance with the benchmark paper by Tran et al. [43], we opted not to utilize the cell state labels option to ensure comparability with other methods, specifying only cell types for BayesPrism.When tumor cells were present, the key parameters in the new.prism()function were configured accordingly, while all other parameters were set to the default options of the algorithm.

BisqueRNA (Bisque)
The Bisque R package v1.0.5 was obtained from https://github.com/cozygene/bisque and utilized by following the guidelines provided in the tutorials available at https://github.com/cozygene/bisque/blob/master/vignettes/bisque.Rmd.Inputs for Bisque included the bulk data, the raw counts gene expression matrix of reference scRNA-seq data, cell type labels, and reference sample labels.Cell type labels were obtained from annotations in the reference scRNA-seq dataset, while reference sample labels corresponded to sample labels from the reference dataset.All other parameters were configured to the default settings of the algorithm.

MuSiC
The MuSiC R package v1.0.0 was obtained from https://github.com/xuranw/MuSiC and implemented by following the instructions provided in the tutorials at https://xuranw.github.io/MuSiC/articles/MuSiC.html.Inputs to MuSiC included the bulk data, the raw counts gene expression matrix of reference scRNA-seq data, cell type labels, and reference sample labels.Cell type labels were derived from annotations in the reference scRNA-seq dataset, while reference sample labels corresponded to sample labels from the reference dataset.All other parameters were left at their default settings.

SCDC
The SCDC R package v0.0.0.900 was obtained from https://github.com/meichendong/SCDC and utilized following the guidelines provided in the tutorials at https://meichendong.github.io/SCDC/articles/SCDC.html.Prior to deconvolution, the raw counts gene expression matrix of the reference scRNA-seq data, along with cell type labels and reference sample labels, were fed into the SCDC_qc() function for quality control (QC).Here, the cell type labels were those annotated in the reference scRNA-seq dataset, and the reference sample labels were obtained from the reference dataset.If multiple reference datasets from different studies were available, they were processed separately using the SCDC_qc() function.Given that the SCDC_qc() function is limited to handling single-cell matrices with up to 65 536 cells at most, if the number of cells in the reference dataset exceeded this limit, 60 000 cells were randomly selected for QC.Following completion of QC, for reference subjects from the same study, the bulk data, output from the SCDC_qc() function, along with cell type labels and reference sample labels, were input into the SCDC_prop() function for deconvolution.However, if reference subjects were from different studies, the SCDC_ENSEMBLE() function was utilized for deconvolution.In this case, bulk data, output from the SCDC_qc() function, cell type labels, and reference sample labels were provided as inputs.All other parameters were left at their default settings.

inteRD
The inteRD R package v0.1.1 was obtained from https:// github.com/chencxxy28/InteRDand utilized according to the guidelines provided in the tutorials at https://chencxxy28.github.io/InteRD/articles/NAME-OF-VIGNETTE.html.To execute inteRD, the input includes bulk data, the output generated by the SCDC_ENSEMBLE() function, cell type labels, and a list of cell type-specific marker genes constructed similarly to CIBERSORT.All additional parameters were left at their default settings.

A brief description of the ReCIDE method and the evaluation of deconvolution methods
The workf low of ReCIDE comprises three primary steps: constructing and refining individual references' cell type-specific MGMs, deconvolution using each MGM, and integrating the deconvolution outcomes (Fig. 1).For detailed information about these steps, please refer to the Materials and methods section.Here, we brief ly describe these steps: Construction and refinement of MGMs: initially, ReCIDE constructs single-cell gene expression matrices for each subject derived from a reference matrix and establishes a cell typespecific MGM following standard procedures [8,17].Each MGM is further optimized to maximize the similarity score between the MGM expression profile and the target expression profile.
Deconvolution: ReCIDE iteratively employs single-referencebased deconvolution approaches such as CIBERSORT, DWLS, and FARDEEP on the target bulk expression profile using each subjectspecific MGM as a reference.
Integration of deconvolution outcomes: our major assumption is that if there exists a large cluster of cell type proportions, then the true set of cell type proportions is likely approximated by those within the cluster.Accordingly, for each cell type, ReCIDE computes the average of the corresponding proportions from the deconvolution results within the identified cluster and considers it the optimal estimation.
Finally, ReCIDE normalizes the estimations for all cell types to ensure their collective sum equals one.This comprehensive approach enhances the precision and robustness of cell type deconvolution, providing a refined and reliable estimation of cellular composition.
To assess the performance of ReCIDE, we devised four distinct benchmark scenarios.The first scenario, denoted as the "leaveone-out test", involves iteratively selecting one subject from the scRNA-seq dataset as a test subject, while the remaining subjects are employed as references for deconvolving the pseudo-bulk expression profile of the test subject.In this scenario, there are minimal batch effects between the reference subjects and the target bulk data.The second scenario, termed the "cross-dataset test", incorporates scRNA-seq reference and pseudo-bulk data sourced from different studies.This set of tests introduces more realistic batch differences in comparison to the first scenario, ref lecting diverse experimental conditions.The third scenario, labeled the "multi-studies test", combines scRNA-seq data from two distinct studies to form a reference, while pseudo-bulk data from a third study are used for testing.This design accounts for the limited number of reference subjects available from a single study and introduces batch differences across studies within the reference set.The fourth scenario, designated the "high-resolution test", annotates scRNA-seq datasets at a finer level (>40 cell types) to specifically evaluate ReCIDE's performance on rare cell types (constituting <2% of cells).This set of tests is also leave-oneout type, as it is difficult to guarantee the accuracy of cell type annotations across datasets at the high-resolution.For each of the aforementioned scenarios, we conducted two benchmark tests, ensuring a comprehensive evaluation of ReCIDE's efficacy under diverse conditions.
ReCIDE implements a single-reference-based deconvolution method in its pipeline.Here, we conducted an evaluation of three best-performing single-reference-based deconvolution methods (CIBERSORT, DWLS, and FARDEEP) recommended by Cobos et al. [17] to guide the selection of the deconvolution kernel for ReCIDE.Then, we compared ReCIDE with four multi-reference-based deconvolution methods (Bisque, BayesPrism, SCDC, and MuSiC) investigated by Chu et al. [13].InteRD is a recently published method specifically on utilizing reference datasets from multiple studies [15].It is therefore compared with ReCIDE in the scenario of "multi-studies test".

Evaluation of ReCIDE's performance on estimating cell type proportions
The selection of the deconvolution kernel for ReCIDE involved an evaluation with three popular single-reference-based methods: CIBERSORT, DWLS, and FARDEEP.The assessment focused on gaging the enhancement in deconvolution accuracy relative to the standalone use of these methods.Remarkably, all three singlereference-based deconvolution methods, which traditionally do not consider cross-subject information within the reference, consistently exhibited improved performance when integrated into ReCIDE, as evidenced by the clearly reduced RMSE measure and improved PCC (Fig. 2a, Supplementary Fig. S1a).For example, using ReCIDE for optimization reduces the average of RMSE values from 0.0518 to 0.0366 and improves the average PCC from 0.812 to 0.877 for DWLS (Fig. 2b, Supplementary Fig. S1b).As DWLS consistently exhibited the highest accuracy among the three singlereference-based methods, both before and after ReCIDE optimization, it was chosen as the deconvolution kernel in ReCIDE, subsequently named ReCIDE-DWLS.
We then conducted a comparative analysis between ReCIDE-DWLS and five multi-reference-based methods (Bisque, BayesPrism, InteRD, MuSiC, and SCDC) across four benchmark test scenarios.Note that InteRD was exclusively examined in the third scenario.Based on the RMSE measure, ReCIDE-DWLS consistently secured the top rank in all eight benchmark tests across the four scenarios (Fig. 2c-f).The use of the PCC measure further corroborated that ReCIDE-DWLS exhibited the strongest correlation with the ground truth compared to all other methods (Supplementary Fig. S1c-f).Aside from ReCIDE-DWLS, no one method perform consistently better than the other methods based on both the RMSE and the PCC measures, and the ranks of these methods differ by using different measures.
The "high-resolution test" was designed to evaluate deconvolution performance on rare cell types (constituting <2% of cells).But the RMSE measure was strongly inf luenced by errors on major cell types.To address this, we recalculated the RMSE considering only rare cell types, revealing that ReCIDE maintained its topranking position among all methods.However, the rankings of other methods underwent changes, exemplified by BayesPrism, which initially ranked fourth but moved up to second in terms of RMSE for rare cell types in the CRC dataset under this scenario (Fig. 2g).
To further inspect the performance of different methods on estimating the proportions of cell types with different sizes, we categorized cell types into three groups based on their sizes: major (constituting more than 10%), minor (more than 2% but less than 10%), and rare (constituting less than 2%).Then, for each method, we generated scatter plots of estimated versus true cell type proportions for all test subjects in the eight benchmark tests and calculated the slope of the linear regression fitting curve.A slope closer to 1 indicates more accurate estimation.The deconvolution accuracy of all methods showed a significant decrease from major to minor to rare cell types, ref lecting the heightened challenges when cell types become rare (Fig. 3a-c).Nevertheless, ReCIDE-DWLS emerged as the best method in each category, with its advantage over other methods notably increasing in scenarios involving minor and rare cell types.In rare cell types, the fitted line slope for ReCIDE-DWLS was 0.525, far outperforming other methods [the slopes range from 0.085 (MuSiC) to 0.267 (DWLS)], underscoring its exceptional performance in deconvolving rare cell types (Fig. 3c).This positions ReCIDE-DWLS as an appealing method for deconvoluting bulk data where rare cell types may hold particular clinical interest.

Validating ReCIDE-DWLS's performance using real bulk dataset
Rare cell types have been reported with clinical significance in various studies.In a scRNA-seq study on COVID-19, dendritic cells (DCs), constituting approximately 0.5%-2% in healthy individuals, were identified as significantly downregulated in COVID-19 patients compared to their healthy counterparts [44].A recent scRNA-seq study on systemic lupus erythematosus (SLE) revealed a substantial downregulation of mucosal-associated invariant T cells (MAIT) in SLE patients in comparison to healthy individuals, and a significant upregulation of proliferative T and NK cells (Prolif) [25].We therefore selected a bulk RNA-Seq dataset on COVID-19 (GSE231409), comprising 279 samples, and a microarray dataset on SLE (GSE50772) with 81 samples [32] as test datasets to evaluate the deconvolution performance on these specific cell types.The reference dataset for both tests was derived from PBMC scRNA-seq data of healthy individuals obtained from the COVID-19 study [24], comprising 23 healthy subjects.The comparative assessment included ReCIDE-DWLS, DWLS, BayesPrism, Bisque, MuSiC, and SCDC.
In the bulk COVID-19 test, ReCIDE-DWLS yielded deconvolution results for the control group that aligned with the expected levels of DCs in healthy individuals (constituting approximately 0.5%-2%) and correctly identified the downregulation of DCs in COVID-19 patients (P-value = .00087;Fig. 4a), consistent with findings from the single-cell COVID-19 study [44].All other deconvolution methods failed to correctly estimate the proportion of DCs cells in healthy individuals and failed to detect differences in the proportion of DCs between healthy individuals and COVID-19 patients.Additionally, BayesPrism and SCDC deconvolution results misidentify DCs cells proportion in patients as higher than in healthy individuals.
In the SLE microarray study, both ReCIDE-DWLS and BayesPrism estimated the levels of MAIT cells in the healthy individuals within the range reported by Perez et al. [25] and correctly identified the significant downregulation of MAIT cells in SLE patients (Fig. 4b).However, compared to BayesPrism, the deconvolution results of ReCIDE-DWLS show stronger significance between healthy individuals and SLE patients.On the other hand, in the deconvolution results of BayesPrism, a considerable proportion of SLE patients have an extremely low proportion of MAIT cells in their PBMCs (<1 × 10 −5 ), which is inconsistent with reality [25].The other methods failed to identify differences in the proportions of MAIT cells, and they underestimated the proportion of MAIT cells in control group.It is worth noting that Bisque, MuSiC, and SCDC estimated a considerable portion of the MAIT cells in both SLE patients and healthy individuals as 0. This might be attributed to significant batch effects between scRNA-seq data and microarray data, posing a challenge to all deconvolution methods.For Prolif cells, ReCIDE-DWLS not only successfully estimated the proportion of Prolif cells in both healthy individuals and SLE patients within their corresponding reported ranges but also correctly identified the significance of the difference in proportions (Fig. 4c).In contrast, although BayesPrism and SCDC correctly identified the significance of the differences, their estimations deviated from the true proportion ranges.Specifically, SCDC estimated the proportion of Prolif cells to be over 10% in healthy individuals and over 15% in SLE patients.Other methods failed to identify the significance of the difference, with MuSiC even overestimating the proportion of Prolif cells to be over 30%.Once again, in the testing based on SLE microarray data, ReCIDE-DWLS demonstrated superior performance in estimating rare cell type proportions.

Application of ReCIDE-DWLS to bulk triple negative breast cancer datasets reveals prognostic relevance of subclasses of T and perivascular-like cells
Due to the superior performance of ReCIDE-DWLS in estimating cell type proportions, we applied this method to three bulk TNBC datasets: GSE164458 (482 patients, bulk RNA-seq), GSE58812 (107 patients, microarray), and the TCGA cohort (232 patients, bulk RNA-seq) [33,34].GSE164458 comprises RNA-seq from pre-treatment biopsies of patients undergoing neoadjuvant chemotherapy, in which information related to patient response in the face of chemotherapy is reported.GSE58812 and the TCGA cohort, on the other hand, consists of patients with reported survival outcomes.Our objective with ReCIDE-DWLS deconvolution was to identify clinically relevant cell types significantly associated with treatment response and patient prognosis.We utilized a reference dataset comprising scRNAseq samples from four studies [28-30, 35, 45], totaling 39 patients (Supplementary Fig. S2).To standardize cell type annotations across studies, we adopted annotations from Wu et al. [28], encompassing 49 cell types, and utilized SciBet to transfer these labels to the scRNA-seq data in the remaining three studies [26].
Following the deconvolution of the three bulk datasets, we initially identified cell types whose proportions significantly correlate with treatment response in GSE164458 and patient prognosis in GSE58812, resulting in the identification of seven cell types (P-value <.05; Fig. 5a).Among these, six cell types (T_cells_c3_CD4 + _Tfh_CXCL13, T_cells_c6_IFIT1, T_cells_c11_MKI67, Myeloid_c0_DC_LAMP3, Myeloid_c2_LAM2_ APOE, and Myeloid_c9_Macrophage_2_CXCL10) show positive associations with TNBC patient prognosis and treatment response, all belonging to immune cells.Additionally, a significant negative correlation exists between PVL_Differentiated_s3 and TNBC treatment response and prognosis, indicating that patients with lower proportions of PVL_Differentiated_s3 are more likely to have favorable treatment responses and prognosis.Subsequently, we validated these seven cell types using the TCGA cohort, successfully confirming the significance of two cell types, T_cells_c3_CD4 + _Tfh_CXCL13 and PVL_Differentiated_s3, both significantly correlated with TNBC patient prognosis (P-value in the TCGA dataset; Fig. 5b).
Our findings are corroborated by recent studies.For example, Zhang et al. demonstrated that TNBC patients with a higher proportion of CD4+ T cells expressing CXCL13 as marker genes (ranging from 0.02% to 7% in the scRNA-seq reference set) tend to benefit from anti-PD1 treatment [46].Furthermore, elevated expression of the CXCL13 gene has been linked to improved prognosis in various cancers, including CRC and ovarian cancer [47,48].Wu et al. identified that PVL_Differentiated_s3 cells are strongly associated with immune evasion in multiple independent TNBC cohorts [49].They categorized PVL cells into three subtypes: PVL_immature_s1, PVL_immature_s2, and PVL_Differentiated_s3 based on their differentiation states.Consequently, we further investigated the correlation between PVL and its subtypes with the prognosis of TNBC patients across three bulk datasets.Similar to PVL_Differentiated_s3, there is a significant negative correlation between PVL and TNBC treatment response and prognosis (Fig. 5c, Supplementary Fig. S6a).Interestingly, although no significant association was observed for PVL_immature_s1 alone, the ratios of PVL_immature_s1/PVL proportions-indicating the proportions of PVL_immature_s1 within PVL cells-were significantly positively correlated with treatment response and survival prognosis (Fig. 5d, Supplementary Fig. S6b).A recent study by DeFilippis et al. discovered that the immature-PVL marker CD36 is enriched in normal tissue regions and is associated with a favorable survival outcome in breast cancer [50], thus supporting our findings on PVL_immature_s1.
Given the associations of T_cells_c3_CD4 + _Tfh_CXCL13, PVL, PVL_immature_s1, and PVL_Differentiated_s3 cell types with TNBC patient prognosis, we sought to develop a prognostic assessment model for TNBC based on these cell types.Utilizing the aforementioned four cell types, we devised a prognostic score calculated as the product of the T_cells_c3_CD4 + _Tfh_CXCL13 to PVL_Differentiated_s3 ratio and the PVL_immature_s1 to PVL ratio, where a higher score suggests a more favorable clinical outcome.In the GSE58812 dataset, we computed the prognostic score for each patient and found that those with scores exceeding 0.0125 exhibited the most significantly improved prognosis compared to those below this threshold (P-value = 1.83 × 10 −06 ; Fig. 5e), indicating its potential for patient stratification.We also computed prognostic scores of patients in the TCGA cohort and found that those with scores above the threshold (0.0125) had significantly longer survival times than those below (Pvalue = .00034;Fig. 5f), confirming the efficacy of our prognostic evaluation model.This underscores the capability of ReCIDE-DWLS in deciphering bulk expression profiles in TNBC patients for robust prognosis assessment.

Discussion
In this study, we introduce ReCIDE which substantially enhances deconvolution accuracy by extending single-reference-based methods like CIBERSORT and DWLS to perform deconvolutions individually for each subject's scRNA-seq data and consolidating the outcomes.Our extensive benchmark evaluation consistently demonstrates ReCIDE's superiority over other methods, with ReCIDE-DWLS particularly excelling in estimating rare cell types, which holds substantial implications for uncovering clinically relevant low-abundance cell populations associated with diseases.
The presence of batch differences between bulk and singlecell reference data, stemming from both technical and biological reasons, poses a significant challenge for cell type deconvolution [11,14].While technical batch differences may be mitigated through more sophisticated algorithms or statistical frameworks [51], biological batch differences are inherent and challenging to overcome, especially for rare cell types.ReCIDE's superior performance can be attributed to its more effective approach in addressing biological batch differences.Firstly, by processing the original reference into subject-specific references, diversity is introduced, reducing biological batch differences by increasing the likelihood of having references biologically relevant to the bulk data.Generally, ReCIDE's performance benefits from a larger reference dataset, particularly when the initial number of subjects is limited, although the rate of improvement diminishes as the number of reference samples increases further (Supplementary Fig. S7).However, since predicting the precise number of reference samples needed for saturation is challenging, it is advisable to utilize as many reference samples as feasible.Secondly, optimizing the MGM to maximize the similarity between reference and bulk data further reduces biological batch differences.Without this optimization procedure, the RMSE values of ReCIDE-DWLS in the eight benchmark tests would increase by 0.6%-10.8%(Supplementary Fig. S8), demonstrating the positive impact of the optimization process.Thirdly, clustering of the deconvolution results enriches results closer to the ground truth.Despite significant variation in each estimation from the true proportions when using different reference sets, there is no consistent bias-sometimes overestimating and other times underestimating.Averaging these estimations therefore consistently improves the final estimate's accuracy toward the true proportions.This approach is particularly effective for estimating rare cell type proportions, given that all existing methods perform poorly in this area.These simple yet effective treatments offer valuable directions for future development of deconvolution methods.
The comparison of ReCIDE-DWLS with other methods using real bulk datasets further underscores its efficacy.In both COVID-19 and SLE studies, ReCIDE-DWLS accurately estimates cell proportions for rare cell types, revealing biologically significant changes undetected by other methods.Its application to TNBC datasets uncovers the prognostic relevance of PVL cells, offering new insights into potential markers for patient outcomes.The ability to identify such associations demonstrates ReCIDE-DWLS's strength in exploring clinically relevant cell types and its potential in large-scale clinical cohort studies.
Despite its strengths, ReCIDE has some limitations.This framework relies on a sufficient number of reference samples, but the optimal sample size required by ReCIDE varies in different scenarios.Determining the adequate number of samples needed in various contexts remains to be established.Additionally, multiple reference samples are required for deconvolution, necessitating high computational resources.In the COVID-19 dataset, which included 23 reference samples and 279 bulk samples, ReCIDE-DWLS completed the deconvolutions within 30 minutes using 50 CPUs running in parallel (Supplementary Table S1).Although it remains within an acceptable timeframe, further optimization is necessary to improve efficiency in the future.
Nonetheless, ReCIDE stands as a powerful framework for enhancing cell type deconvolution accuracy.ReCIDE-DWLS outperforms existing approaches in benchmark and real datasets, particularly excelling in estimating rare cell type proportions, positions it as a valuable tool in unraveling the complexities of cellular composition in large-scale clinical datasets, ultimately advancing our understanding of disease mechanisms and improving prognostic assessments.

Key Points
• ReCIDE introduces an innovative framework for robust estimation of cell type proportions by integrating singlereference-based deconvolutions.• ReCIDE-DWLS outperforms existing approaches in benchmark and real datasets, particularly excelling in estimating rare cell type proportions.• Application of ReCIDE-DWLS on TNBC datasets reveals correlation between subclasses of T cells and PVL cells, both of which are rare cell types, and the prognosis of TNBC patients, demonstrating its clinical relevance.

Figure 1 .
Figure 1.Workf low of ReCIDE.(a) Construction of subject-specific reference MGMs.(b) Optimization of subject-specific reference MGMs and separate deconvolutions on the target bulk data.(c) PCA dimensionality reduction and GMM clustering on the deconvolution results, yielding the final outcomes.

Figure 2 .
Figure 2. Benchmark performance of 11 deconvolution methods, including pre and post-ReCIDE optimization, across 4 scenarios.(a) Comparison of RMSE values before and after ReCIDE optimization for three single-reference-based methods (CIBERSORT, DWLS, and FARDEEP) in eight benchmark tests.(b) Average RMSE values for CIBERSORT, DWLS, and FARDEEP before and after ReCIDE optimization across eight benchmark tests.(c-f) Deconvolution accuracy assessed by RMSE values in leave-one-out, cross-dataset, multi-studies, and high-resolution scenarios, respectively.(g) Deconvolution accuracy based on RMSE values for rare cell types (constituting <2% of cells) in high-resolution tests.

Figure 3 .
Figure 3. Fitted curves comparing estimated cell type proportions with true cell type proportions across various deconvolution methods.Panels (a-c) display fitted curves for major, minor, and rare cell types, respectively, generated by different deconvolution methods.The slopes of the fitted curves are depicted in the corresponding subfigures on the right.

Figure 5 .
Figure 5.The clinical relevance of cell types deconvolved by ReCIDE-DWLS across three TNBC bulk datasets.(a) Seven cell types significantly associated with treatment response and prognosis (P-value <.05).For each cell type, boxplots compare proportions between RD (residual disease) and pCR (pathological complete response) patients in GSE164458, while Kaplan-Meier plots demonstrate survival time differences between patients with proportions greater or smaller than the median in GSE58812.(b) Cell types from (a) successfully validated in the TCGA dataset (P-value <.05), displaying prognostic curves of progression-free survival.(c) The correlation of PVL cell proportion with clinical response to therapy and prognosis in TNBC patients.(d) The correlation of PVL_immature_s1/PVL ratios with clinical response to therapy and prognosis in TNBC patients.(e) Kaplan-Meier plots derived from training on GSE58812.Prognostic scores for all patients in GSE58812 were calculated, and the score with the most significant P-value (.0125) was selected to plot the Kaplan-Meier plots (P-value = 1.83 × 10 −06 ).(f) Kaplan-Meier plots resulting from applying the prognostic score threshold established from GSE58812 to the TCGA dataset (P-value = .00034).
Data from Wu et al. and Pal et al. were the reference sets, while data from Bassez et al. were used for pseudo-bulk data.